An Effective Grammar-Based Compression Algorithm for Tree Structured Data

نویسندگان

  • Kazunori Yamagata
  • Tomoyuki Uchida
  • Takayoshi Shoudai
  • Yasuaki Nakamura
چکیده

Many semistructured data such as HTML/XML files are represented by rooted trees t such that all children of each internal vertex of t are ordered and all edges of t have labels. Such data is called tree structured data. Analyzing large tree structured data is a time-consuming process in data mining. If we can reduce the size of input data without loss of information, we can speed up such a heavy process. In this paper, we consider a problem of effective compression of an ordered rooted tree, which represents given tree structured data, without loss of information. Firstly, in order to define this problem in a grammar-based compression scheme, we present a variable replacement grammar (VRG for short) over ordered rooted trees. The grammar-based compression problem for an ordered rooted tree T is defined as a problem of finding a VRG which generates only T and whose size is minimum. For the grammar-based compression problem for an ordered rooted tree, we show that there is no polynomial time algorithm with approximation ratio less than 8593 8592 unless P=NP. Secondly, based on this theoretical result, we present an effective compression algorithm for finding a VRG which generates only a given ordered rooted tree and whose size is as small as possible. Finally, in order to evaluate the performance of our grammar-based compression algorithm, we report some experimental results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Statistical Modeling for the Compression of Tree Structured Intermediate Code

We propose a scheme for the compression of tree structured intermediate code consisting of a sequence of trees specified by a regular tree grammar. The scheme is based on arithmetic coding, and the model that works in conjunction with the coder is automatically generated from the syntactical specification of the tree language. Experiments on data sets consisting of intermediate code trees yield...

متن کامل

A Tree Structured Bayesian Scalar Quantizer for Wavelet Based Image Compression

the pyramid. Recently, a number of promising quanMultiresolution image decompositions (e. g., wavelets), in conjunction with a variety of quantization schemes, have been shown to be very effective for image compression. Recently, several promising tree-structured quantization schemes that exploit the correlation across scales have been proposed. In this paper, we present an image compression al...

متن کامل

Approximation of smallest linear tree grammar

A simple linear-time algorithm for constructing a linear context-free tree grammar of size O(rg + rg log(n/rg)) for a given input tree T of size n is presented, where g is the size of a minimal linear context-free tree grammar for T , and r is the maximal rank of symbols in T (which is a constant in many applications). This is the first example of a grammar-based tree compression algorithm with...

متن کامل

Joint Optimization of Scalar and Tree-structured Quantization of Wavelet Image Decompositions

Wavelet image decompositions generate a tree-structured set of coeecients, providing an hierarchical data-structure for representing images. While early wavelet-based algorithms for image compression concentrated on optimal quantization of wavelet coee-cients, several recent researchers have proposed approaches which couple coeecient quantization (either scalar or vector-based) with various str...

متن کامل

Data Compression-Based Approaches to Analysis of Biological Networks

Data compression-based methods have been effectively applied to analysis of biological sequences and protein structures. We have been extending this approach for analysis of biological networks. We are also developing grammar-based compression algorithms for structured data. In this extended abstract, we briefly review these approaches.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003